A machine learning approach to economic complexity based on matrix completion

This work applies Matrix Completion (MC) – a class of machine-learning methods commonly used in recommendation systems – to analyze economic complexity. In this paper MC is applied to reconstruct the Revealed Comparative Advantage (RCA) matrix, whose elements express the relative advantage of countries in given classes of products, as evidenced by yearly trade flows. A high-accuracy binary classifier is derived from the MC application to discriminate between elements of the RCA matrix that are, respectively, higher/lower than one. We introduce a novel Matrix cOmpletion iNdex of Economic complexitY (MONEY) based on MC and related to the degree of predictability of the RCA entries of different countries (the lower the predictability, the higher the complexity). Differently from previously-developed economic complexity indices, MONEY takes into account several singular vectors of the matrix reconstructed by MC. In contrast, other indices are based only on one/two eigenvectors of a suitable symmetric matrix derived from the RCA matrix. Finally, MC is compared with state-of-the-art economic complexity indices, showing that the MC-based classifier achieves better performance than previous methods based on the application of machine learning to economic complexity.

Since the early 2000s, building metrics for measuring economic complexity has been a set goal. Starting from the Economic Complexity Index (ECI) developed by Hidalgo and Hausmann 1 , it has become clear how most traditional economic growth theories often shrank internal socio-economic dynamics of countries through strict assumptions, restricting the analysis to a small subset of pre-determined factors. Unlike traditional growth theories, economic complexity measures are based on a data-driven approach and are generally agnostic about the determinants of countries' competitiveness. For instance, the ECI seeks to explain the knowledge accumulated by a country and expressed in all the economic activities present in that country. More and more refined measures of economic complexity have become available in the last few years. In a recent review, Hidalgo 2 identifies two main streams of the literature on economic complexity: the first involves metrics of so-called relatedness, whereas the second concerns economic complexity metrics, which apply dimensionality reduction techniques based, e.g., on Singular Value Decomposition (SVD). Metrics of relatedness measure the affinity between an activity and a location, while methods related to dimensionality reduction search for the best combination of factors explaining the structure of a given specialization matrix.
According to the principle of relatedness, the probability that a location c (e.g., a country) enters or exits an economic activity p (e.g., a sector) is influenced by the presence of related activities in that location. This poses, however, more profound questions about the role played by similar countries in determining the likelihood that the location c enters the economic activity p. Furthermore, while the principle of relatedness attempts to model the probability of entering an activity p by the location c, it does not provide hints about whether c will enter p successfully or not. Besides, there is a strong connection between the concept of production function -a function connecting economic inputs to outputs -and economic complexity via the SVD of a suitable specialization matrix. Indeed, as discussed in Hidalgo 2 , the Cobb-Douglas production function of a given set of outputs might be expressed in terms of the SVD of the specialization matrix. In particular, in that context, the singular vectors represent the factors of production employed to produce a given set of outputs. Having more of such factors contributes to better explain the outputs. A similar idea applies to the case of economic complexity, for which SVD is used to learn the singular vectors (factors) that best explain the structure of a specialization matrix R∈ R C×P , where C is the number of countries considered in the analysis, and P is the number of products examined (at a given aggregation level). The ECI index is closely related to the leading singular vectors of that specialization matrix (Hidalgo 2 ), i.e., to a truncated SVD of the matrix. These are also the leading eigenvectors of the product of the specialization matrix with its transpose. Usually, scholars select one of the first two eigenvectors (i.e., the www.nature.com/scientificreports/ ones associated with the two largest eigenvalues) because it carries out the maximum amount of information. However, this approach has some important drawbacks, as discussed in Morrison et al. 3 . Recently, Sciarra et al. 4 combined information coming from the first two eigenvectors into a unique index called GENeralised Economic comPlexitY (GENEPY). Nevertheless, it is worth noticing that, by doing this, the other eigenvectors are neglected and, together with them, further information which could potentially better explain economic complexity. Therefore, it looks reasonable to explore a suitable way to carefully select some other most informative eigenvectors (or, more in general, singular vectors) beyond the first two. In this respect, the present paper exploits a class of machine-learning methods called Matrix Completion (MC) to extract the information coming from a subset of entries of the specialization matrix, to predict remaining entries of that matrix. Such information is encoded by the singular values and singular vectors of a suitable "approximating matrix" constructed automatically by MC. The approach adopted here differs from (truncated) SVD (Hidalgo 2 ), because in our framework the specialization matrix is assumed to be only partially observed, in the sense that a given subset of matrix entries, which are provided as inputs to the learning machine in its training phase, is used to predict other portions of the matrix, which are initially artificially hidden to the learning machine. In a second phase, the latter entries are used as a ground truth, for validation and testing purposes. Moreover, we exploit the difficulty in predicting via MC some entries of the specialization matrix to quantify, in an aggregate way, the set of unobservables that make a country more competitive than expected. Such unobservables form the unexplained part of the specialization matrix. For the first time in the literature on economic complexity, a measure of the degree of predictability via MC of the entries of the specialization matrix corresponding to different countries is, then, used to define a novel index of economic complexity for countries (however, our proposed framework can be easily extended to the case of products). We have been inspired by the Total Factor Productivity (TFP) measure in economic growth models. Abramovitz 5 called TFP "a measure of our ignorance" since it is what is left unexplained by growth in the inputs of the production function. More precisely, our main idea is to adopt MC to infer information about the Relative Comparative Advantage (RCA), or disadvantage, of a country in a given trade category of products. Such information is collected, for each year, in a matrix, RCA ∈ R C×P . In formulas, one has where D c,p is the value in international dollars of the exports of the product p by country c. In case one among D c,p , D c,p ′ , D c ′ ,p , and D c ′ ,p ′ in Eq. (1) is not available, one gets RCA c,p = NaN . In this case, as a pre-processing step, that RCA c,p value can be replaced by 0. In order to extract topological information from the RCA matrix, it is common in the literature (see, e.g., Sciarra et al. 4 ) to consider also the associated incidence matrix M ∈ R C×P , whose entries are defined as follows: In the paper, MC is applied several times (starting from different training subsets of suitably discretized RCA values associated with several countries and products, excluding originally NaN values) to estimate the expected RCA values of pairs of countries c and products p that have not been used in the training phase. To fulfill this task, the adopted MC technique is based on a soft-thresholded SVD, which selects each time -via a suitable regularization technique -the subset of most informative singular values and corresponding singular vectors. The work contributes to the literature on economic complexity in three ways: (i) it applies for the first time MC to assess the complexity of countries and to predict the evolution of international trade; (ii) it defines a novel index of economic complexity based on MC; (iii) it builds up a comparison with state-of-the-art indices of economic complexity, revealing a high correlation between the output of GENEPY when it is applied to the original incidence matrix and the false positive rate of a binary classifier derived by the repeated application of MC. The results of our analysis show that MC performs well in estimating the RCA of countries. Supported by the highquality predictions of MC, we propose a novel Matrix completion iNdex of Economic complexitY (MONEY) for countries, which exploits the accuracy of their RCA predictions derived from the repeated application of MC. Such accuracy is expressed in terms of a suitably weighted Receiver Operating Characteristic (ROC) Area Under the Curve (AUC), one for each country examined. The MONEY index ranks countries according to their degree of predictability, taking into account also the complexity of the products. Specifically, the larger the AUC for a specific country and the larger the average with respect to a subset of the products of that country of the MC performance in estimating the discretized RCA values of country-product pairs, the less complex that country. Using MC to construct the proposed index helps to solve one shortcoming of other economic complexity measures, i.e., the fact that, differently from MC, they take into account only the information coming from the leading eigenvectors. For instance, GENEPY and our proposed application of MC differ in various aspects, also in the particular case in which the specialization matrix has two large singular values followed by other much smaller ones. Indeed, GENEPY takes as input the whole specialization matrix, transforms it into a symmetric matrix, and finds the two largest eigenvalues of the latter matrix. Differently, in the present context, MC is applied several times, for different partitions of the specific specialization matrix into training/validation/test sets. Moreover, for each repetition of MC, the specialization matrix is only partially observed, in such a way that only an approximate SVD can be obtained for it by MC. Moreover, the GENEPY index computed using the MC surrogate incidence matrix reveals interesting discrepancies in terms of economic complexity with respect to the www.nature.com/scientificreports/ original GENEPY, i.e., the one calculated starting from the incidence matrix associated with the observed RCA matrix. Finally, we show that our MC-based classifier has better prediction performance than machine learning methods proposed in Tacchella et al. 6 , which, to the best of our knowledge, is the only work exploiting machine learning to analyze economic complexity and international trade.

Predicting the economic complexity: a matrix completion approach
In this work, we apply Matrix Completion (MC) techniques to study economic complexity. This class of machinelearning methods has been popularized by the so-called Netflix competition; see the Supplementary material for further details on MC and Hastie et al. 7 , Alfakih et al. 8 , and Cai et al. 9 for some of its applications. To illustrate the potential of MC in the analysis of economic complexity, in this paper we use MC to estimate the expected Revealed Competitive Advantage (RCA) of countries c and products p. However, our approach is general and can be applied to different locations (i.e. firms, cities or regions) and for other activities (i.e. patents, scientific production, skills). The specific MC method adopted in the paper consists in completing a partially observed matrix A ∈ R C×P (which is derived from the RCA matrix in our case, and represents the specialization matrix in our context), by minimizing a suitable trade-off between the reconstruction error of the known portion of that matrix and a penalty term, which penalizes a high nuclear norm of the reconstructed (or completed) matrix. This is formulated via the following optimization problem (Mazumder et al. 10 ): where tr is a training subset of pairs of indices (c, p) corresponding to positions of known entries of the partially observed matrix A ∈ R C×P , Z ∈ R C×P is the completed matrix (to be optimized), ≥ 0 is a regularization constant (chosen by a suitable validation method), and �Z� * is the nuclear norm of the matrix Z , i.e., the sum of all its singular values. A possible state-of-the art iterative algorithm to solve the optimization problem (3) is called Soft Impute (Mazumder et al. 10 ). Its main idea consists in replacing iteratively (until convergence) the unobserved elements of the matrix A with those imputed by a soft-thresholded SVD. The reader is referred to the Supplementary material for further technical details on the optimization problem (3) and on the Soft Impute algorithm. While MC has already found many applications in several fields (e.g., movie recommendation, sensor engineering, econometrics), to the best of our knowledge, this is the first time it is used to analyze economic complexity. More precisely, we applied MC to construct surrogates of a specialization matrix and to define a novel complexity index, then we compared the obtained results with the ones provided by other state-of-the-art complexity indices.
In the following, we describe our approach of applying MC to the reconstruction of the RCA matrix for the case in which the products were aggregated at the 4-digits level in the Harmonized System Codes 1992 (HS-1992). In the Supplementary material we provide results at different levels of aggregation of trade (2 and 6 digits). Consistently with the literature (Sciarra et al. 11 ), we constructed the matrix A (one of the inputs to the optimization problem (3)) by discretizing the elements of the RCA matrix (see the Supplementary material for details on its construction). For the sake of brevity, in the following we describe in detail only the MC application to the definition of a measure of complexity of the locations (i.e., countries in our case). To get a measure of complexity of the activities (i.e., products), it is enough to replace the matrix A with its transpose (see also the Supplementary material for some related results).
1. For the matrix A ∈ R C×P (where C = 119 is the number of countries, and P = 1243 is the number of products), the MC optimization problem (3) was solved N = 1000 times by the Soft Impute algorithm, based on various choices for the training/validation/test sets (and, as already mentioned, for the regularization parameter ). 2. For each such repetition n = 1, . . . , N , the sets above were constructed as follows. First, a (pseudo)random permutation of the rows of A was generated. Then, a subset S n of these rows was considered, by including in it the first row in the permutation and the successive s% ≃ 25% rows. In this way, the resulting number of elements of the set S n was |S n | = 30 . Next, for each row in S n , its elements belonging to all the groups except group "0" (associated with originally NaN RCA values) were obscured independently with probability p missing = 0.3 (see the Supplementary material for a robustness check of the results of the analysis with respect to the choice of p missing ). The (indices of the) remaining entries of the matrix A (excluding the ones belonging to the group "0") formed the training set (denoted by tr n ). The obscured entries in one of the |S n | rows (say, row h ∈ {1, . . . , |S n |} ) formed the test set (denoted by test n,h ), whereas the obscured entries in the remaining |S n | − 1 rows formed the validation set (denoted by val n,h ). 3. For each repetition n, the generation of the validation and test sets from the set S n was made |S n | times, each time with a different selection of the row h associated with the test set (and, as a consequence, also of the |S n | − 1 rows associated with the validation set). Hence, the same training set was associated with |S n | different pairs of validation and test sets (the number of repetitions N = 1000 and the percentage s% ≃ 25% were selected in order to associate each row with the test set a sufficiently large number of times, with high probability; in particular, with these choices, the average number of times each row was associated with the test set was about 250). In this way, for each choice of S n and of the regularization parameter , the MC optimization problem (3) was solved once instead of |S n | times, thus improving the computational efficiency. Finally, by construction, each time there was no overlap between the training, validation, and test sets. www.nature.com/scientificreports/ 4. To avoid overfitting, for each choice of the training set tr n , the optimization problem (3) was solved for 30 choices k for , exponentially distributed as k = 2 (k−1)/2 for k = 1, . . . , 30 . The resulting completed (and post-processed, see the Supplementary material) matrix was indicated as Z (n) k . For each k and each of the |S n | selections of the validation sets associated with the same training set, the Root Mean Square Error (RMSE) of matrix reconstruction on that validation set was computed as then the choice k • (n,h) minimizing RMSE val n,h k for k = 1, . . . , 30 was found (see the Supplementary material for an example of computation of k • (n,h) ). Finally, the RMSE of matrix reconstruction on the related test set was computed in correspondence of the so-obtained optimal value k • (n,h) as 5. For each choice of n and h, the MC predictions contained in the matrix Z (n) k • (n,h) were used to build a binary classifier. More precisely, each time an element A c,p of the matrix A was in the test set, such element was attributed to the class 0 (corresponding to the case 0 ≤ RCA < 1 ) when its MC prediction from Z (n) k • (n,h) was lower than 0, otherwise it was attributed to the class 1 (corresponding to the case RCA ≥ 1 ). Finally, the average classification of the element A c,p (with respect to all the test sets to which that element belonged) , whereas its most frequent classification (either 0 or 1) was indicated as Â (MC) c,p . A random assignment between 0 and 1 was made to deal with ties. In the (unlikely) case the element A c,p appeared in none of the test sets, both A (MC) c,p and Â (MC) c,p were chosen to be equal to 0 (due to the choice p missing = 0.3 , each element A c,p not associated with the group "0" appeared in the test set on average about 75 times; so, the probability that one such element appeared in none of the test sets was negligible). 6. A first MC surrogate M (MC) ∈ R 119×1243 of the incidence matrix M was defined as follows: Similarly, a second MC surrogate M (MC) ∈ R 119×1243 of the incidence matrix M was defined as follows: 7. Finally, in order to assess the prediction capability of the binary classifier associated with MC (see Step 5 above), for each row (country) c of A , we also computed the false positive rate fpr c and the false negative rate fnr c as the average classification error frequency, respectively, of the true negative/true positive examples in all the test sets associated with that row (where the "negative class" refers to the class 0 associated with 0 ≤ RCA < 1 , and the "positive class" to the class 1 associated with RCA ≥ 1).
Concluding, in our application of MC to economic complexity, the MC optimization problem (3)  to be compared with the original GENEPY index, which is obtained by taking as input to the GENEPY algorithm the original incidence matrix M.

The matrix completion index of economic complexity (MONEY)
In this section, we introduce our proposed economic complexity index, called Matrix cOmpletion iNdex of Economic complexitY (MONEY), whose construction is based on MC.
The MONEY index is built starting from the matrix M is denoted as AUC c . We remind the reader that, for a binary classifier, the ROC curve expresses the trade-off c,p , otherwise . www.nature.com/scientificreports/ between fall-out (false positive rate) and sensitivity (true positive rate) of that classifier (both computed on the test set), as a function of its threshold. It is recalled here that the true positive rate is equal to 1 minus the false negative rate. In general, ROC curves closer to the top-left corner indicate a better performance. As a baseline, a random guessing binary classifier is associated with a ROC curve with points lying along the diagonal indicated, e.g., later in Fig. 1a (for which the true positive rate is equal to the false positive rate). The closer a ROC curve to the diagonal in the ROC space, the worse the performance of the associated binary classifier. It is worth reminding the reader that ROC curves do not depend on class frequencies. This makes them useful for evaluating classifiers predicting rare events as in the case of very high RCA values. We also remind the reader that the AUC measures the area of the entire two-dimensional region underneath the entire ROC curve and above the diagonal from (0, 0) to (1, 1). The AUC is exploited in the literature to provide an aggregate measure of performance across all possible classification thresholds. Formally, it represents the probability that a classifier will rank a randomly chosen positive instance higher than a randomly chosen negative one, assuming that "positive" ranks higher than "negative" (Fawcett 12 ). In more detail, for each country c, the elements of the c-th row of the matrix M (MC) are compared with a threshold to construct the associated binary classifier. The elements belonging to the same row of the original incidence matrix M are taken as ground truth. The discrimination threshold is varied from 0 to 1, using a step size equal to 0.01. All the elements of M (MC) are used as dataset, except those with the same indices as the originally NaN values in the RCA matrix. This allows to form a binary classifier for each threshold and for each country. The idea now is to exploit the AUC c of the binary classifiers associated with the countries in order to provide a measure of complexity of such countries, based on the degree of predictability of the corresponding rows. Specifically, countries with lower AUC c may be considered as more complex, being harder for MC to predict their RCA entries. The AUC c alone, however, does not capture the reasons why MC performed poorly (or, vice versa, adequately). As an example, consider the three following hypothetical scenarios. Assume that MC performed poorly on a country c by attributing RCA ≥ 1 to a product p when its true RCA was smaller than 1, and assigned correctly a RCA smaller than 1 to all the other countries for the same product p (Scenario 1). Consider now the two following similar scenarios for which, for the same product p and the same country c, MC still performed poorly on c by wrongly attributing RCA ≥ 1 to p, and it attributed either correctly (Scenario 2) or incorrectly (Scenario 3) RCA ≥ 1 to all the other countries for the same p. All other things being equal, the country c is reasonably more complex in Scenario 1 than in Scenario 2. In fact, while in Scenario 2, MC could have been driven to predict, for country c, a RCA of p larger than or equal to 1 by the presence of several RCA entries larger than or equal to 1 for the other countries, this is not the case for Scenario 1. Finally, in Scenario 3, it is not possible to conclude that country c is more complex than the other countries, since MC is wrongly attributing RCA ≥ 1 to p, for all such countries. Nevertheless, such a scenario is quite unlikely to occur (as it is shown later in the article, MC has typically a quite satisfying prediction capability in its specific application to the discretized RCA matrix). The example above suggests us that, by adopting the AUC c alone as a complexity measure, country c would be classified as equally complex in Scenarios 1, 2 and 3 (assuming the AUC c being equal in all these cases). In order to correct for this, we propose a refined complexity measure, based on weighting the AUC c for each country c. The rationale of the proposed complexity measure is that not only less predictable countries (according to MC) are more complex, but one should also take into account the product dimension when comparing the MC predictions obtained for different countries, controlling for the quality of each prediction. More precisely, it is proposed to associate a weight w c to each country c, which is constructed in such a way that the AUC c 's of countries with an higher share of "rare" false positives are weighted less (since they are less predictable). In more detail, the proposed complexity measure is constructed as follows.
(a) First, the MC analysis made for the countries is repeated for the products, still referring to the same year. This is obtained simply by replacing at the beginning of the analysis the RCA matrix with its transpose. and M ⊤ , restricted to the entries associated with that product) and N p N p +P p the proportion of entries with true RCA < 1 with respect to all the entries associated with that product (i.e., 119). (d) Moreover, the average of ftot p,t with respect to t∈ T is computed. (e) Then, for each country c, one computes both AUC c and a weight w c , which is defined as follows: In other words, for each country c, the weight w c is the average of ftot p with respect to all the products p for which one predicts RCA ≥ 1 through the MC surrogate incidence matrix (M ⊤ ) (MC) . (f) Finally, the MONEY index for each country c is computed as:

Results
Global performance of matrix completion. In this section we report some summary statistics of the prediction performance of MC. Likewise in the section above, the matrix M (MC) was combined with a threshold to construct a binary classifier (in this case, however, differently from that section, the threshold did not depend on the country). The discrimination threshold was varied from 0 to 1, using a step size equal to 0.01. All the elements of M (MC) were used as dataset, except the ones having the same indices as the originally NaN values in the RCA matrix. The ground truth was provided by the corresponding elements of the original incidence matrix M . Figure 1a shows the resulting ROC curve. The global AUC for the matrix M (MC) when compared to the realworld matrix of year 2018 is 0.81. Similarly, ROC c curves for selected countries are displayed in Fig. 1b. As it is evident from Fig. 1, MC performed quite well on average both globally and for developed countries such as Japan, United States and Germany. Its performance was poorer (though still above the baseline) for countries that either provided less information on their trade flows or whose trade flows were extremely volatile (i.e., they alternated between products with extremely high RCA values and products with very low RCA values). Specifically, fnr c was higher for the latter countries. Nonetheless, the average performance of MC over all the countries was high as depicted by the AUC reported in Fig. 1a. As a further check, since the positive and negative labels were unbalanced in the original dataset (specifically, entries with RCA < 1 represented almost the 70% of the entire dataset), we also computed the Balanced Accuracy (BACC) index, which turned out to be 0.75 (with a threshold on M (MC) equal to 0.5). We recall that the BACC is a performance metric designed for binary classifiers in the case of unbalanced datasets. It is calculated as the average of the proportion of correctly classified elements of each class individually, and ranges from 0 (low balanced accuracy) to 1 (maximum balanced accuracy). Formally, it is equal to tpr + tnr /2 , where t stands for "true". An alternative index used for unbalanced datasets is the F1 score, which is defined as F1 = ) is 0.74. In the Supplementary material we repeated the analysis at a more refined level of aggregation of world trade (HS-6 level). At the HS-6 level the global AUC reduced to 0.72 and the best F1 score of MC when applied at the HS-6 level turned out to be 0.73. This is a good performance considering that no other feature of products and countries was used for predictions but various subsets of elements of the matrix A , given as inputs to MC. www.nature.com/scientificreports/ Figure 2 displays the original incidence matrix M as compared to the MC surrogate incidence matrix M (MC) obtained at the HS-4 level of product aggregation (in the figure, blue stands for 0 and yellow for 1). The two matrices display similar but not identical entries. On one hand, their similarity confirms the good MC prediction performance at a global level. On the other hand, their differences could be attributed to the high complexity of specific country/product pairs being predicted. In other words, there may be a discrepancy between the actual RCA value of a country/product pair and its potential RCA value, predicted by MC on the basis of similar country/product pairs. The MONEY index. Results in the previous section highlight that MC reaches a good performance in predicting the RCA values of country-product pairs. However, some countries have more false positive predictions than others. Figure 3 displays the false positive rate fpr c for each country considered in the analysis. Surprisingly, the ranking of countries by false positive rate turns out to be quite similar to the one generated by the GENEPY economic complexity index (Kendall rank correlation coefficient τ k = 0.75 ). This is reassuring since the key insight of the MONEY index is that the activities of more complex countries are more difficult to predict, as confirmed by the strong correlation between the false positive rate of MC and GENEPY.
This section reports also the ranking of countries by economic complexity as expressed by the MONEY index. Results are compared to other popular measures of economic complexity (see Fig. 4): ECI from Hidalgo and Hausmann 1 , Fitness from Tacchella et al. 13 , and GENEPY from Sciarra et al. 4 . Overall, the rankings are similar, with some remarkable differences for the MONEY ranking. Australia ranks much higher (and in the top positions) according to MONEY than all other indices. In the bottom part of the MONEY ranking, we find Malaysia, which ranks higher based on the other economic complexity measures.  www.nature.com/scientificreports/ Despite some relevant differences, economic complexity indices turn out to be quite correlated. Table 1 reports the values of the Kendall rank correlation coefficient τ k when comparing the rankings (world trade at the HS-4 level) provided by ECI, Fitness, GENEPY, MONEY, and GDP per capita Purchasing Power Parity (PPP) for the year 2018. The level of correlation of MONEY with the other economic complexity indices and with GDP per capita, PPP is similar to the one of ECI, Fitness, and GENEPY.
In the following we take a closer look to the differences between MONEY and GENEPY (see Fig. 5a). Both indices are normalized to the interval [0, 1]. In Figure 5a, countries are colored according to their MONEY values, which are proportional to the shade of blue: the color map ranges from the least complex countries (colored in white) to the most complex ones (colored in dark blue). Figure 5b shows the difference between the normalized values of MONEY and GENEPY. A different color map is used in Fig. 5b, due to its different meaning with respect to Fig. 5a. Since developed countries tend to have similar export baskets (Sciarra et al. 4 ), the predictability of their RCA values via MC might be higher, hence reducing their MONEY index to some extent.
Both GENEPY and MONEY indices arise from the attempt to reconstruct (in a different way for each method) a matrix related to trade flows. In the case of GENEPY, the matrix is a proximity matrix N derived from the incidence matrix M (see the Supplementary material for the definition of the matrix N ), and its reconstruction is obtained as a nonlinear least-square estimate based on the components of the first two (normalized) eigenvectors of that matrix. Then, a successive evaluation on how the quality of the estimate changes by dropping specific components of such eigenvectors (the ones associated with a given country) is made. In our case, the matrix A is obtained as a discretization of the RCA matrix. Then, MC is applied several times to the matrix A to reconstruct a portion of that matrix which has been obscured, in the attempt to uncover a "latent" similarity between countries, which can be useful for the prediction of RCA entries. Another difference is that the matrix reconstruction on which GENEPY is based relies only on two eigenvectors of N , whereas our method, being also based on MC, exploits a typically much larger number of left-singular/right-singular vectors to build the reconstructed matrix, for each application of MC. The choice of the number of such pairs is made automatically by the adopted validation procedure. Further comparative results are available in the Supplementary material for different years and aggregation levels.

Differences in the GENEPY indices based on the original incidence matrix M and on M (MC)
. Another potential use of MC is to predict the missing values in the trade dataset, in a similar way as in Metulini et al. 14 . Missing values amount to around 25% of the entries of the M matrix. This implies using, for the originally NaN entries of that matrix, the corresponding values of their most frequent classifications derived MONEY ranking (2018)   . To quantify the correlation between the GENEPY rankings computed based on M and M (MC) , respectively, we evaluated their Kendall rank correlation coefficient τ k . The statistical test produced τ k ≃ 0.8 with a p-value near 0, rejecting significantly the null hypothesis of independence between GENEPY and GENEPY (MC) . With a few exceptions (China, France, Italy, UK and Germany) the more complex the country according to GENEPY, the higher the difference between GENEPY and GENEPY (MC) . Despite machine learning opens to new possible solutions for the imputation of missing values, more progress is needed to refine the performance of MC for missing imputation, possibly combining it with other machine learning techniques (Metulini et al. 14 ).

Discussion
Machine learning has an enormous potential to enhance the quality of the prediction of economic growth and competitiveness (Longo et al. 15 ). In the present work, we applied Matrix Completion (MC) to investigate the economic complexity of countries in various ways . First, we assessed the high accuracy of the MC predictions, when MC was applied to reconstruct the Revealed Comparative Advantage (RCA) matrix, which is at the basis of the construction of several existing economic complexity indices (see the Supplementary material). Then, we proposed the Matrix cOmpletion iNdex of Economic complexitY (MONEY), based on the degree of predictability of the RCA entries associated with different countries. The MONEY index is based on the idea that complex economic systems are more difficult to predict. As an additional contribution, we compare MC with recently-developed economic complexity indices, to assess the expected economic complexity of countries. As an example, in the work, MC was exploited to infer the expected discretized RCA of a country c for a product p. The MC technique employed is based on a soft-thresholded SVD. This, combined with the MC validation phase, allows to select automatically a suitable number of singular vectors to be used to reconstruct the RCA matrix. Differently from previous economic complexity indices, the information extracted by MC is not restricted to the first two singular vectors, but a suitable number of singular vectors is selected to optimize the out-of-sample prediction performance.
Our results highlighted a good performance of MC in discerning country-product pairs with RCA values greater than or equal to the critical threshold of 1, denoting the expected competitiveness of country c in exporting product p. The outcomes were summarized by reporting the global ROC curve and comparing the heat-map of the true incidence matrix M and the one of its MC surrogate matrix M (MC) , which was obtained from various applications of MC. Motivated by the high MC accuracy, we developed the MONEY index taking into account both the predictive performance of MC for each country (as measured by its AUC c ) and the product dimension. In other words, when constructing that index, each AUC c was weighted by the average of the ftot p 's with respect to a subset of products associated with the specific country. The MONEY index can be used as a measure of economic complexity to predict the future growth potential of countries. Moreover, MC can help to deal with missing values, when the missingness pattern is not random. More in general, future research is needed to further increase the quality of the predictions of machine learning methods when applied to economic complexity. The maps were generated using the MATLAB 2012b package borders, available for free (upon registration) at the following hyperlink: https:// it. mathw orks. com/ matla bcent ral/ filee xchan ge/ 50390-borde rs.

Data availibility
The RCA values used in our analysis come from the CEPII -BACI dataset, which is freely distributed under the Etalab Open License 2.0, and can be retrieved at the following hyperlink: http:// www. cepii. fr/ cepii/ en/ bdd_ modele/ bdd. asp.